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Abstract 

Although  incremental  learning  has  many  advantages  in  theory,  it  is  not  in  practice 
as  widely  used  as  non-incremental  learning  for  real-world  applications.  One  major 
reason  for  this  situation  is  the  lack  of  incremental  algorithms  that  can  perform  as  fast 
as  non-increinental  algorithms  in  general.  In  this  paper,  we  present  an  effective  yet 
wry  efficient  incremental  algorithm  CDL4  for  learning  decision  lists  whose  complexity  is 
0{dn  ),  where  dis  the  number  of  attributes  and  n  the  number  of  training  instances.  On 
the  experiments  we  have  conducted,  CDL4’s  performance  is  as  fast  and  accurate  as  the 
best  non-incremental  learning  algorithms  for  batch  tasks,  and  is  much  faster  than  the 
best-known  incremental  and  non-incremental  learning  algorithms  for  serial  tasks.  We 
also  show  that  efficient  incremental  algorithms  can  provide  new  research  opportunities 
for  learners  to  actively  select  training  instances  for  better  accuracy  and  higher  speed, 

and  that  incremental  learning  may  eventuaUy  outperform  non-incremental  learning  in 
many  aspects. 


1  Introduction 


Incremental  learning  is  a  style  of  learning  where  the  learner  updates  its  model  of  the  envi¬ 
ronment  (e.g.,  a  hypothesis  for  an  unknown  concept)  whenever  a  new  significant  experience 
becomes  available.  Considerable  research  has  been  devoted  to  this  style  of  learning,  in¬ 
cluding  Samuel’s  checkers  player  [1959],  Winston’s  algorithm  [1975],  Mitchell’s  Candidate 
Elimination  Algorithm  [1982],  Fisher’s  COBWEB  [1987],  Laird  et  al’s  SOAR  [1986],  Sut¬ 
ton’s  temporal-difference  learning  [1988],  Utgoff’s  ID5R  [1989]  and  ITI  [1993],  Shen’s  CDLl 
[1990],  CDL2  [1992],  and  D*  [1993],  and  Rivest  and  Schapire’s  algorithm  for  learning  finite 
state  machines  without  resetting  [1993]. 

Compared  to  non-incremental  or  batch  learning,  incremental  learning  has  the  advantages 
of  being  more  widely  applicable  and  reactive.  For  example,  it  can  be  applied  to  situations 
where  input  data  come  only  in  sequence  and  a  timely  updating  model  is  crucial  for  ac¬ 
tions.  However,  no  existing  incremental  algorithms  can  perform  as  fast  as  non-incremental 
algorithms  in  general,  and  this  limits  the  practice  of  incremental  learning  in  real-world  ap¬ 
plications. 

The  slowness  of  incremental  algorithms  stems  from  the  nature  of  incremental  learning. 
Different  from  batch  learning,  incremental  learning  deals  with  theory  revision  at  each  time 
step  instead  of  theory  creation  from  scratch.  An  incremental  algorithm,  without  knowing  any 
training  information  that  might  become  available  in  the  future,  must  diagnose  the  deficiency 
of  its  current  theory  and  make  the  best  choice  in  revising  the  theory.  Both  these  tasks, 
however,  are  not  concerns  for  non-incremental  learning. 

To  increase  the  efficiency  of  incremental  algorithms,  fast  methods  must  be  developed  for 
both  diagnosis  and  revision  tasks.  CDL4  achieves  its  speed  by  realizing  that  generalizing  a 
theory  is  equivalent  to  discriminating  its  complement,  thus  using  discrimination  as  a  single 
mechanism  for  both  theory  generalization  and  specialization.  When  using  CDL4  to  learn 
a  decision  list,  theory  diagnosis  becomes  detecting  incorrect  decisions  in  the  current  list, 
and  theory  revision  becomes  discriminating  an  incorrect  decision  using  the  differences  found 
between  a  misclassified  training  instance  and  the  accumulated  instances  of  that  decision. 
Compared  to  the  earlier  algorithms  in  this  family,  such  as  CDLl  and  CDL2,  CDL4  uses 
a  very  efficient  strategy  to  find  the  differences  between  instances  and  can  deal  with  both 
discrete  and  continuous  attributes  in  a  uniform  fashion.^ 

CDL4  has  been  applied  to  several  (very)  large  learning  tasks.  Experimental  results  have 
shown  that,  with  roughly  the  same  accuracy,  CDL4  is  much  faster  than  other  incremental 
algorithms  (e.g.,  ITI)  for  serial  tasks,  and  most  surprisingly,  is  as  fast  as  the  best-known  non- 
incremental  algorithms  (e.g.,  C4.5)  for  batch  tasks.  Furthermore,  in  some  learning  tasks, 
CDL4  achieves  much  higher  accuracy  than  other  algorithms  when  the  training  instances  are 
presented  in  certain  orders.  This  phenomenon  suggests  that  by  arming  themselves  with  a 
strategy  to  choose  training  instances  actively,  incremental  learning  algorithms  could  eventu¬ 
ally  outperform  non-incremental  algorithms  in  both  speed  and  accuracy. 

In  the  rest  of  this  paper.  Section  2  presents  some  key  arguments  for  why  incremental 
learning  is  necessary.  Section  3  gives  a  brief  review  on  decision  lists.  Section  4  and  5 
describes  the  CDL4  algorithm  in  detail.  Section  6  reports  the  experimental  results  of  CDL4 

^CDL3  [Shen,  1994]  is  an  algorithm  that  learns  decision  lists  in  the  format  of  First-Order  Logic. 


in  three  learning  tasks  and  provide  a  comparison  with  other  algorithms.  Section  7  analyzes 
the  complexity  of  CDL4.  Section  8  examines  the  effects  of  training  order  on  CDL4’s  accuracy 
performance  and  calls  for  new  research  in  actively  selecting  training  instances  for  incremental 
learning  algorithms.  The  paper  is  concluded  with  a  list  of  future  research  directions. 

2  The  Significance  of  Incremental  Learning 

What  is  incremental  learning  and  why  is  it  useful?  In  this  section,  we  argue  that  in  the  con¬ 
text  of  agent  learning  from  their  environment  [Shen,  1994],  incremental  learning  is  the  only 
way  to  provide  a  continuously  updated  model  of  the  environment  for  agents  to  intelligently 
choose  their  actions  (e.g.,  actively  select  training  instances  for  concept  learning).  In  many 
situations,  such  an  ability  is  crucial  for  the  survival  of  the  agents. 

The  task  of  learning  a  target  concept  C  from  a  sequence  of  training  instances  can  be 
summarized  as  follows:  the  learner  is  to  construct  a  concept  hypothesis  Hj  to  approximate 
C  based  on  the  training  instances,  Xi,  ...,Xj,  that  have  been  seen  so  far.  Such  a  task  can  be 
accomplished  by  either  batch  (non-incremental)  learning  or  incremental  learning. 

In  batch  learning,  a  hypothesis  Hj  at  time  j  is  created  from  the  scratch  based  on  all 
the  instances  from  Xi  through  Xj,  without  any  intermediate  hypothesis,  as  in  the  following 
function 

Hj  =  J^x{xi,...,x,]. 

In  incremental  learning,  on  the  other  hand,  Hj  is  constructed  recursively  based  on  the 
intermediate  hypothesis  Hj-i  and  the  current  instance  Xj,  as  follows: 

Hi  =  .F2[-ff,_i,x,],  where  1  <  i  <  i,  and  Ho  is  given. 


Between  these  two  styles  of  learning,  empirical  evidence  has  shown  that  can  be  com¬ 
puted  much  faster  than  and  that  raises  the  question  whether  the  latter  is  needed  at  all. 
The  three  early  arguments  in  favor  of  are:  the  relief  of  keeping  past  instances  or  equiva¬ 
lent  information  [Michalski,  1985;  Schlimmer  and  Fisher,  1986],  the  availability  of  a  continual 
hypothesis  for  reacting  to  new  stimuli  [Schlimmer  and  Fisher,  1986],  and  the  suitability  for 
serial  tasks  in  which  instances  come  only  in  sequence  [Utgoff,  1989].  Unfortunately,  the 
strength  of  the  first  argument  is  weakening  because  the  low  cost  and  high  speed  of  modern 
memory.  The  strength  of  the  last  two  arguments  is  based  on  the  assumption  that  there 
is  an  efficient  mechanism  for  updating  the  hypothesis  incrementally,  for  otherwise  one  can 
always  use  a  T\  algorithm  to  build  a  new  Hj  from  scratch  for  every  new  instance  Xj,  or 
accumulate  enough  instances  from  the  stream  and  update  the  hypothesis  periodically.  Since 
the  most  efficient  and  accurate  concept  learning  algorithms  today  are  of  type,  doubts 
remain  whether  J-2  type  of  learning  is  indeed  necessary. 

Notice  that  all  the  earlier  arguments  for  style  of  learning  assume  that  the  learners  are 
passive.  They  are  given,  either  in  batch  or  in  sequence,  the  training  instances  to  learn.  In 
many  applications,  however,  the  learner  can  or  must  actively  select  the  next  training  instance. 
For  example,  in  language  learning,  a  child  or  a  system  can  ask  specific  questions.  In  learning 
a  finite  state  machine,  the  learner  must  act  upon  the  machine.  In  learning  grammars,  the 
learner  must  query  the  teacher  for  new  strings  and  their  memberships  (e.g.,  [Angluin,  1987]). 


In  all  these  cases,  the  next  training  instance  will  not  come  unless  the  learner  performs  an 
aetion. 

Active  learning  has  two  important  implications.  First,  choosing  the  next  action  imposes 
a  firm  demand  on  a  model  that  is  continuously  updated  based  on  the  learner  s  experience  in 
the  environment.  It  will  be  too  costly  to  delay  updating  model  and  wait  for  more  instances 
to  accumulate.  Second,  choosing  the  **right”  next  instance  may  affect  the  performance  of 
learning  dramatically.  For  example,  in  function  approximation  (a  much  more  formal  type 
of  “concept  learning”),  the  samples  used  in  interpolation  must  be  selected  according  to 
Cebysev’s  formula  [Nikolskii,  1963]  or  else  the  approximation  may  not  converge. 

Having  said  that  the  most  important  aspect  of  incremental  learning  is  to  have  a  continu¬ 
ously  updated  model  for  active  learning,  we  may  allow  ourselves  to  use  some  memory  in  the 
process  of  learning.  In  other  words,  we  may  consider  the  following  Tz  style 

Hi  =  ...,x,],  where  1  <i  <j,  and  Hq  is  given. 

also  as  incremental  learning  even  though  it  is  allowed  to  access  subsets  of  previous  instances. 
Such  a  style  of  learning  seems  justifiable  because  intelligent  creatures  do  have  memories,  and 
remembering  past  experience  can  be  very  useful  in  dealing  with  noise.  We  may  call  this  style 
of  learning  behavior-incremental  learning  for  it  does  incrementally  update  the  hypothesis  H 
at  each  step. 

Thus,  the  real  question  is  not  whether  incremental  learning  is  needed  or  not,  but  whether 
we  can  develop  efficient  and  accurate  incremental  learning  algorithms  to  make  the  task  of 
continuously  updating  a  model  practical.  The  CDL4  algorithm  is  one  of  such  attempts. 


3  Decision  Lists 

Before  we  present  the  CDL4  algorithm,  let  us  first  review  briefly  the  notion  of  decision  lists. 
Proposed  by  Rivest  [1987]  for  Boolean  formulas,  a  decision  list  is  a  list  L  of  pairs 

where  each  test  function  fj  is  a  conjunction  of  literals,  each  Vj  is  a  value  in  {0,1),  and  the 
last  function  fr  is  the  constant  function  true.  For  convenience,  we  shall  refer  a  pair  ifj,Vj)  as 
the  j-th  decision,  fj  the  j-th  decision  test,  and  Vj  the  ;-th  decision  value.  In  this  paper,  we 
use  an  extended  definition  of  decision  lists  in  which  Vj  is  a  value  from  a  finite  set  of  concept 
names  rather  than  from  the  binary  set  {0,1},  and  fj  is  a  conjunction  of  predicates. 

Given  a  decision  list  L,  the  decision  on  an  instance  x  is  defined  to  be  Dj  =  {fj,  Vj),  where 
j  is  the  least  index  such  that  fj{x)  =  1.  In  this  case,  we  say  that  x  is  covered  hy  Dj,  that  Dj 
is  the  covering  decision  of  x,  and  that  x  is  classified  by  L  as  an  instance  of  Vj.  For  example, 
given  the  following  decision  list: 

Li  =  (((03  <  1000.0)  A  (04  /  V),C'i),((ai  <  980.0),  Gj),  (true,  Ci)) 

where  a,'  are  attribute  names  and  C,  are  concept  names,  and  an  instance  x  =  [(o(i=300.0) 
(a2=‘g’)  (03=100.0)  (a4=‘u’)],  then  this  instance  x  is  covered  by  the  second  decision  ((oi  < 
980.0),  C2)  and  it  is  classified  as  an  instance  of  6*2. 


4  The  CDL4  Algorithm 

CDL4  is  a  Tz  style  algorithm  whose  goal  is  to  construct  incrementally  a  decision  list  L  from 
a  sequence  of  training  instances  under  the  supervision  of  a  teacher.  When  a  new  instance  x 
is  misclassified  by  L  (i.e.,  the  classification  made  by  L  does  not  match  the  classification,  say 
(7x,  given  by  the  teacher),  CDL4  revises  L  to  classify  x  correctly.  To  accomplish  this  task, 
the  algorithm  must 

1.  Determine  whether  there  is  a  misclassification  and  which  decision  Dj  in  L  is  responsible; 

2.  Determine  to  what  extend  Dj  should  be  discriminated; 

3.  Discriminate  Dj  and  modify  L. 

For  the  first  task,  CDL4  searches  through  the  decision  list  and  returns  the  decision 
Dj  =  {fj,Vj),  where  j  is  the  least  index  such  that  fj{x)  =  1.  If  Vj  is  not  equal  to  x’s  classi¬ 
fication  provided  by  the  teacher,  then  fj  is  the  decision  test  that  needs  to  be  discriminated. 
Otherwise,  the  instance  x  is  recorded  as  an  example  of  Dj. 

If  the  new  instance  is  misclassified  by  the  current  decision  list,  then  CDL4  performs  the 
second  task  by  finding  out  the  minimal  differences  between  the  new  instance  x  and  the  set 
E  of  previous  examples  of  Dj  (recall  that  an  example  of  a  decision  is  an  instance  recorded 
under  that  decision).  The  key  idea  here  is  to  first  construct  a  differentiator  predicate  for 
each  attribute  (except  those  that  have  unknown  values  in  the  new  instance),  and  then  select 
these  differentiators  one  at  a  time,  according  to  their  strengths  to  distinguish  x  from  E,  until 
all  (or  most  of)  the  examples  in  E  are  differentiated  from  x.  The  final  result  is  a  disjunction 
of  a  set  of  differentiators,  each  of  which  is  true  for  some  examples  in  E  but  false  for  the  new 
instance  x. 

Given  a  new  instance  x  and  a  set  of  previous  examples  E,  the  differentiators  are  con¬ 
structed  as  follows.  If  an  attribute  Ok  is  discrete,  then  the  corresponding  differentiator  is 
constructed  as  a  predicate 

Tk{y}  =  (ofc  ^  ^fc)) 

where  j/  is  a  parameter  to  be  bound  to  an  instance  (or  an  example),  Ok  is  an  attribute  variable 
for  y,  and  Vk  is  a  constant  equal  to  the  value  of  a*  in  the  new  instance  x.  For  example,  if  the 
new  instance  x  is  [. . .,  (a4=‘g’)],  then  the  differentiator  Tffy)  is  defined  as  (04  7^‘g’).  This 
differentiator  is  true  for  those  examples  in  E  whose  04  value  is  not  equal  to  ‘g’,  and  is  false 
for  the  instance  x. 

If  an  attribute  a*,  is  continuous,  then  CDL4  scans  the  attribute  values,  represented  as 
Ca*.,  in  the  example  set  E,  and  constructs  the  corresponding  attribute  differentiator  as 

^  /  X  f  (ok  <  Vk)  if  more  than  half  of  the  values  Cay  in  E  are  less  than  Vk 

^  I  (a.  >  V,)  otherwise 

where  y  and  Vk  are  defined  as  before.  Again,  this  differentiator  is  true  for  some  of  the 
examples  in  E  and  is  false  for  the  instance  x.  The  motivation  of  this  definition  is  to  construct 
a  differentiator  that  can  distinguish  as  many  examples  in  E  as  possible  from  the  new  instance 


X.  Since  differentiators  are  guaranteed  to  be  false  on  the  new  instance  x,  their  strength  are 
computed  as  the  number  of  examples  in  E  for  which  they  return  true. 

To  illustrate  the  entire  step  of  finding  differences  between  x  and  E,  suppose  a  decision 
has  three  examples:  [300.0, g, 100.0, u],  [300.0,g,1000.0,g],  [1000.0,g,1200.0,g],  and  the  new 
instance  is  [980.0,g,1000.0,u].  In  this  case,  the  differentiators  are  defined  as  Ti{y)  =  (oi  < 
980.0)  with  strength  2,  T2{y)  =  (oa  ^  ‘5^’)  with  strength  0,  Tz{y)  =  (03  >  1000.0)  with 
strength  1,  and  Ti{y)  =  (04  ^  ‘u’)  with  strength  2.  The  minimal  differences  are  returned  as 
((ai  <  980.0),  (04  ^  ‘u’))  because  these  two  differentiators  have  higher  strengths  and  they 
can  distinguish  all  three  examples  from  the  new  instance. 

After  the  differences  between  a  misclassified  instance  and  the  previous  examples  of  the 
covering  decision  Dj  are  found,  the  third  task  is  to  discriminate  the  covering  decision  and 
modify  the  decision  list.  For  this  task,  CDL4  replaces  Dj  =  by  a  new  decision 

D'-  =  (/j  AS,  Vj),  where  S  is  the  differences  found  in  the  second  step.  Since  S  is  a  disjunction 
of  differentiators  <ri,  ...,  the  new  decision  Zl'  can  be  represented  as  a  list  of  new  decisions: 

(/i  A  (7i,  Vj),  •  •  • ,  (/j  A  ad,  Vj). 

Note  that  none  of  the  new  decisions  will  capture  the  new  instance.  The  previous  examples 
of  the  old  decision  Dj  are  then  distributed  to  these  new  decisions  in  the  following  manner: 
an  example  e  is  distributed  to  fj  A  a,  where  i  is  the  least  index,  1  <  *  ^  d,  such  that 
[/i  A  (T,](e)  =  1. 

After  the  incorrect  decision  is  replaced,  the  new  instance  x  continues  to  look  for  a  decision 
in  the  remainder  of  the  old  decision  list.  Suppose  the  new  covering  decision  is  Dk,  where 
k  >  {j  +  d).  If  Vk  =  Cx  (i.e.,  the  decision  is  correct),  then  the  instance  is  recorded  as  an 
example  of  Dk.  Otherwise,  Dk  is  discriminated  just  as  Dj  was.  This  process  continues  until 
the  instance  finds  a  decision  with  the  correct  value.  If  the  instance  reaches  the  end  of  the 
list,  i.e.,  Dk  —  (true, Ufc),  then  either  Vk  —  Cx  and  x  can  be  added  to  Dk  s  example  list,  or 
Dk  is  discriminated  and  a  new  default  decision  (true,  Cx)  is  appended  at  the  end  of  the  list 
with  X  as  its  only  example.  The  pseudocode  of  the  CDL4  algorithm  is  listed  in  Figure  1. 

To  illustrate  the  algorithm,  let  us  go  through  an  example  to  see  how  CDL4  learns  a 
decision  list  representing  the  following  binary  concept:  C  =  aiUz  V  0102  V  aiazUd.  We  assume 
there  are  five  attributes  01,02, 03,  Ud,  and  05,  and  use  0{,  where  i  =  1,...,5,  to  represent 
(oi  =  1),  and  Oi  to  represent  (o,  ^  1)  or  (c,-  =  0).  The  training  instances  are  given  from 
xo  =[0102030405]  to  a;3i  =[0102030405]  in  that  order. 

When  the  first  training  instance  Xq  =[0102030405]  (G  C)  arrives,  CDL4  initiates  a  decision 
list  with  a  single  “default”  decision: 

((true,  C)), 

and  the  instance  is  recorded  as  an  example  of  this  sole  decision.  Since  the  next  three  instances 
Xi  =[0102030405],  X2  =[0102030405],  and  X3  =[0102030405]  are  all  being  predicted  correctly, 
they  also  become  examples  of  the  default  decision.  The  fifth  instance  is  X4  =[0102030405] 
(€  C),  and  CDL4’s  prediction  is  wrong.  The  difference  between  (xo,  2:1, 0:2, 0:3)  ^d  X4  is 
found  to  be  (03),  so  the  decision  is  shrunk  to  be  (o3,C),  and  a  new  default  (true,C)  (with 
X4  as  its  example)  is  appended  at  the  end.  The  new  decision  list  is  now: 

((^3,C)(true,C)). 


Inputs:  A  decision  list  D  for  a  set  of  unknown  target  concepts  Ci, Cu 
a  new  instance  x  of  one  of  the  targets  Cr, 
and  a  set  E  of  previous  examples  of  these  targets; 

Outputs:  A  refined  decision  list  D] 

Procedure  CDL4(D,  E,  x): 

If  D  is  empty, 

then  D  =  {-Do}  =  {{true,  Cx))  and  record  a:  as  an  example  of  Do, 
else  loop: 

Let  Dj  =  ifjyVj)  be  the  covering  decision  on  x, 

If  the  decision  is  correct,  i.e.,  vj  = 

then  record  a:  as  an  example  of  Dj  and  exit  the  loop, 

else  let  E  =  (cTi,  (T^)  be  the  differences  found  between  previous  examples  of  Dj  and  x, 
Replace  {fj ,  Vj)  by  {fj  A<Ti,Vj),- •  ■ ,  {fj  A  <r<j,  vj), 

Distribute  the  examples  of  Dj  to  the  new  decisions, 

If  Dj  was  the  last  decision,  then  append  (true,Cc)  at  the  end  of  D. 


Figure  1:  The  CDL4  Incremental  Learning  Algorithm 

With  this  decision  list,  the  succeeding  instances  are  predicted  correctly  until  Xiq  =[0102030405] 
CDL4’s  decision  is  (03,  C)  but  the  instance  belongs  to  C.  Comparing  the  examples  of  the 
decision  with  the  new  instance,  CDL4  finds  the  difference  to  be  oi.  The  decision  is  then 

replaced  by  (0103,  C):  _ 

((ai03,C')(true,C')). 

The  troublesome  instance  then  finds  (true,  C)  to  be  correct  and  becomes  an  example  of  the 
default  decision.  For  succeeding  instances,  CDL4’s  prediction  is  correct  on  xir  =[0102030405] 
but  wrong  on  xis  =[01020^30405].  The  decision  is  (true,  (7)  but  xis  is  in  C.  The  differences 
found  are  (04,01),  so  the  decision  (true,  C)  is  replaced  by  ((0^4,  C),  (01,(7),  (true,  (7)),  yield- 
ing:  _  ^  _ 

((0103,  (7)(o4,  (7)  (oi,  (7)  (true,  (7)). 

With  this  decision  list,  CDL4  correctly  classifies  the  next  five  instances,  X19  through  X23,  but 
fails  on  X24  =[0102030405].  The  decision  is  (04,(7)  but  the  instance  is  in  (7.  The  difference 
between  this  decision’s  examples  {xiq,  xn,  X20,  X21)  and  X24  is  found  to  be  (02).  The  decision 
(04,  (7)  is  then  shrunk  into  (0402,  (7)  and  the  new  decision  list  is  now; 

((0103,  (7)(a4a2,  (7)  (oi,  (7)  (true,  (7)) 

This  decision  list  is  equivalent  to  the  target  concept  (7,  and  it  correctly  classifies  all  the  rest 
instances  X25  through  X31. 

5  Improvements  of  CDL4 

The  CDL4  algorithm  can  be  improved  in  several  ways.  The  first  one  is  to  construct  decision 
lists  that  are  shorter  in  length.  CDL4  does  not  guarantee  learning  the  shortest  decision  list, 
and  the  length  of  the  final  decision  list  depends  on  the  order  of  the  training  instances.  If  the 


instances  that  are  more  representative  (i.e.,  that  represent  the  critical  differences  between 
target  concepts)  appear  earlier,  then  the  length  of  the  decision  list  will  be  shorter.  Although 
learning  the  shortest  decision  list  is  a  NP-complete  problem  [Rivest,  1987],  there  are  some 
strategies  to  aid  in  maintaining  the  list  as  short  as  possible. 

Every  time  a  decision  is  replaced  by  a  list  of  new  decisions,  we  can  check  to  see  if  any  of 
these  new  decisions  can  be  merged  with  any  of  the  decisions  with  greater  indices.  A  merger 
of  two  decisions  u)  and  u)  is  defined  to  be  {fm-,  where  fm  is  a  conjunction  of  those 
predicates  that  appear  in  both  fi  and  fj.  (Conceptually,  fm  covers  at  least  the  union  of  /,• 
and  fj).  Two  decisions  A'  =  (/t,Ui)  and  Dj  =  {fj,Vj)  (*  <  j)  can  be  merged  if  the  following 
conditions  are  met:  (1)  The  two  decisions  have  the  same  value,  i.e.,  u,-  =  Vj;  (2)  None  of 
the  examples  of  Di  is  captured  by  any  decisions  between  i  and  j  that  have  different  decision 
values;  (3)  The  merged  decision  ifm,Vj)  does  not  block  any  examples  of  any  decisions  after 
j  that  have  different  values. 

To  illustrate  the  idea,  consider  the  following  example.  Suppose  the  current  decision  list 
is  ((as,  C')(ai,C')(tr«e,  (7)),  and  examples  of  these  three  decisions  are  {^102^3],  [010203]}, 
{[010203],  [010203]},  and  {[01O2O3]},  respectively.  Suppose  the  current  instance  is  a:  =  [010203] 
(€  C),  for  which  CDL4  has  made  the  wrong  decision  (03,(7).  Since  the  difference  between 
{[010203],  [010203]},  (the  examples  of  Di)  and  x  is  (oi),  the  decision  (o3,0  should  be 
replaced  by  (0301,(7),  which  would  result  in  a  new  decision  list  ((0^301,  (7)(ai,  (7)(true,  (7)). 
However,  the  new  decision  (0301,(7)  can  be  merged  with  {true,^)  because  it  has  the  same 
decision  value,  and  none  of  its  examples  can  be  captured  by  (oi,  (7),  and  the  merged  decision, 
which  is  {true,  (7),  does  not  block  any  other  decisions  following  it.  Thus,  the  decision  list  is 
shortened  to:  ((oi,  C){true,  C)). 

The  second  improvement  over  the  basic  CDL4  algorithm  is  to  deal  with  instances  that 
belong  to  different  concepts  simultaneously.  (This  is  sometimes  called  the  noise  in  the 
training  data.)  To  handle  such  instances,  we  relax  the  criterion  for  a  correct  decision  to  be: 

A  decision  Dj  =  {fj,Vj)  is  correct  on  an  instance  x  that  has  concept  value  Cx  if 

either  Vj  =  (7®,  or  x  is  already  an  example  of  Dj. 

For  example,  suppose  a  decision  {ciio,2,C)  currently  has  one  example  {[aia2]€  (7}  and 
a  new  instance  [axa-^  arrives  and  claims  to  be  in  (7,  then  the  decision  (aia2,  (7)  will  be 
considered  to  be  correct  because  [0102]  is  already  in  its  example  set.  With  this  new  criterion, 
a  decision  may  have  duplicate  examples.  Examples  that  belong  to  the  same  decision  may 
have  the  same  instance  with  different  concept  values,  and  the  value  of  the  decision  may 
be  inconsistent  with  some  of  its  examples.  To  deal  with  this  problem,  we  adopt  a  policy 
that  the  value  of  a  decision  is  always  the  same  as  the  concept  value  that  is  supported  by 
the  most  examples.  For  instance,  if  another  example  [aiC[2]€  (7  is  claimed  by  the  above 
decision  again,  then  the  value  of  the  decision  will  be  changed  from  (7  to  (7  because  (7  will 
be  supported  by  two  examples  vs.  only  one  example  for  (7.  With  this  policy,  we  must  also 
relax  the  criteria  for  finding  differences.  Instead  of  returning  a  set  of  differentiators  that 
can  distinguish  all  previous  examples  of  a  decision  from  the  new  instance,  one  may  find  it  is 
enough  to  distinguish  most  previous  examples  (e.g.,  above  a  user-specified  threshold). 


6  Experiments  with  CDL4 

In  this  section,  CDL4  is  compared  with  three  existing  algorithms:  CN2  [Clark  and  Niblett, 
1989],  C4.5  [Quinlan,  1993],  and  ITI  [Utgoff,  1993],  on  three  (very)  large  learning  tasks. 
These  algorithms  represent  a  wide  range  of  non-incremental  or  incremental  algorithms  and 
different  concept  representations  such  as  decision  trees  and  classification  rules.  CN2  is  chosen 
for  its  ability  to  learn  ordered  (and  unordered)  decision  rules,  which  are  very  similar  to 
decision  lists  learned  by  CDL4.  C4.5  is  chosen  for  its  high  performance  for  learning  decision 
trees  and  its  extension  C4.5rules  to  learn  unordered  decision  rules.  Both  CN2  and  C4.5  are 
well-known  non-incremental  algorithms.  For  incremental  algorithms,  we  have  chosen  ITI^, 
which  is  a  descendant  of  ID5R  and  the  newest  and  perhaps  the  best  algorithm  that  learns 
decision  trees  incrementally.  It  is  also  known  that  for  serial  tasks,  ITI  is  faster  than  rerunning 
C4.5  on  the  new  training  set  each  time  a  new  instance  is  accumulated.  All  these  algorithms, 
including  CDL4,  are  implemented  in  the  C  language  and  running  on  a  SPARC-20  machine 
with  128M  main  memory.  Furthermore,  all  source  codes  are  obtained  directly  from  their 
original  authors  and  this  author  did  not  reimplement  anything. 

The  experiments  chosen  here  represent  several  different  flavors  of  learning  tasks.  The 
first  one  is  to  learn  a  concept  of  “win”  or  “lose”  from  a  set  of  551  chess  end  games  that 
represented  by  39  binary  attributes.  Although  this  task  does  not  have  a  large  size  and 
contains  no  noise,  it  is  chosen  for  historical  reasons.  (The  task  is  first  proposed  by  Quinlan 
[1983]  and  subsequently  used  in  many  incremental  learning  papers.)  The  second  task  is  to 
recognize  hand-written  numerals  (from  “0”  to  “9”)  from  bit  maps.  The  size  of  this  task 
is  relatively  large  (3301  bit  maps  each  having  64  bits),  and  there  is  noise  in  the  data  (i.e., 
two  bit  maps  that  look  the  same  are  labeled  as  different  numerals).  Finally,  the  third  task 
is  to  learn  to  recognize  different  font  letters  (from  “A”  to  “Z”)  that  are  represented  as  16 
attributes.  The  size  of  this  task  is  large,  including  20,000  instances,  and  its  attributes  have 
continuous  values.  In  fact,  this  is  the  largest  learning  task  we  found  in  the  ML  Repository 
maintained  at  the  University  of  California,  Irvine  (ml-repository@ics.uci.edu). 

The  results  of  these  three  experiments  are  reported  in  Table  1,  2,  and  3.  The  format 
of  the  tables  are  the  same:  the  first  column  is  the  name  of  the  algorithms,  along  with  an 
indication  whether  the  training  instances  are  presented  as  a  batch  or  a  sequence  (serial). 
The  second  column  states  for  serial  tasks  if  the  order  of  the  training  instances  are  given 
as  it  is  or  chosen  by  hand.  The  third  column  gives  the  average  accuracies  (with  standard 
deviations)  of  learned  concepts  on  the  unseen  testing  data.  The  fourth  column  is  the  average 
of  CPU  time  taken  in  the  cross-validation  learning  (testing  time  is  not  included).  Finally, 
the  Icist  column  is  an  indication  of  the  size  of  the  learned  concept.  If  the  learned  concept 
is  a  decision  list  (or  a  set  of  rules  in  the  case  of  CN2  and  C4.5rules),  then  the  value  in  this 
column  is  the  average  number  of  decisions  or  rules  in  the  decision  list  or  the  rule  set.  If  the 
concept  is  a  decision  tree  (in  the  case  of  C4.5  with  pruning  and  ITI),  then  the  value  is  the 
average  number  of  nodes  in  the  decision  tree.  Note  that  the  number  of  nodes  in  decision 
trees  cannot  be  directly  compared  with  the  number  of  rules;  they  are  included  here  only  for 
the  completeness. 


^Both  ITI  and  ID5R  are  T3  style  algorithms. 


Table  1:  Comparison  on  classifying  chess  end  games  (551x39) 


Program  (Mode) 

TrainingOrder 

Accuracy 

CPU  sec. 

Concept  Size 

C4.5  (batch) 

N/A 

91.10%  (3.67%) 

0.30 

46.6 

C4.5rules  (batch) 

N/A 

92.90%  (3.40%) 

3.50 

21.4 

CN2  (batch) 

N/A 

84.21%  (6.85%) 

7.70 

17.0 

ITI  (batch) 

N/A 

91.96%  (2.29%) 

2.23 

142.8 

ITI  (serial) 

Insensitive 

91.96%  (2.29%) 

95.88 

142.8 

CDL4  (serial) 

Given 

87.66%  (2.82%) 

0.31 

76.4 

CDL4  (serial) 

Choose 

96.19%  (2.36%) 

0.24 

41.7 

6.1  Classifying  Chess  End  Games 

In  this  experiment,  each  algorithm  performs  a  10  run  cross-validation,  using  90%  of  the 
randomly  selected  data  for  training  and  the  remaining  10%  for  testing.  The  results  in  Table  1 
indicate  that  the  speed  of  CDL4  is  almost  the  same  as  the  best  non-incremental  algorithm 
C4.5  in  batch  mode  (0.31  vs.  0.30),  and  is  much  faster  than  the  other  non-incremental  or 
incremental  algorithms.  When  the  training  orders  are  given  as  those  selected  by  the  cross- 
validation,  CDL4’s  prediction  accuracy  is  not  the  best  but  better  than  the  non-incremental 
algorithm  CN2.  However,  if  a  good  training  order  is  chosen  for  each  training  set,  then  CDL4 
is  faster  (0.24)  and  more  accurate  (96.19%)  than  other  algorithms.  In  this  experiment,  a 
good  training  order  is  determined  as  follows.  We  order  the  training  instances  by  the  values  of 
each  attribute  (in  both  increasing  and  decreasing  orders),  and  among  these  orders  we  select 
one  that  gives  the  best  test  accuracy.  Note  that  there  are  Nl  possible  orders  for  a  training 
set  of  N,  and  we  have  only  considered  2  x  H  of  them  in  this  experiment,  where  B  is  the 
number  of  attributes.  In  general,  we  have  not  yet  found  a  systematic  way  to  select  a  good 
training  order  (see  more  discussion  in  Section  8). 


6.2  Recognizing  Hand- Written  Numerals 

In  this  experiment,  each  algorithm  again  performs  a  10  run  cross-validation,  using  90%  of  the 
randomly  selected  data  for  training  and  the  remaining  10%  for  testing.  As  shown  in  Table  2, 
CDL4’s  speed  (15.87  or  14.06)  is  again  close  to  the  best  non-incremental  algorithm  (5.20  for 
C4.5)  in  batch  mode  and  much  faster  than  serial  ITI  (1578.49).  When  the  training  orders  are 
given  as  those  selected  by  the  cross-validation,  CDL4’s  prediction  accuracy  (75.07%)  is  lower 
than  others  in  this  domain.  However,  when  good  training  orders  are  chosen,  its  accuracy  is 
77.91%  and  again  higher  than  the  others.  Like  the  strategy  we  used  for  the  chess  domain, 
the  good  orders  are  selected  among  those  orders  based  on  the  values  of  attributes  (there 
are  64x2  such  orders  in  this  domain).  In  addition,  we  have  also  analyzed  the  error  matrix 
(how  much  each  class  is  misclassified  by  others)  and  considered  some  training  orders  based 
on  class  values  that  are  “similar.”  For  example,  instances  of  “3”  and  “5”  and  instances  of 
“8”  and  “9”  are  presented  adjacently  in  such  selected  training  orders. 


Table  2:  Comparison  on  the  numeral  recognition  task  (3331x64) 


Program  (Mode) 

TrainingOrder 

Accuracy 

CPU  sec. 

ConceptSize 

C4.5  (batch) 

N/A 

77.70%  (2.88%) 

5.20 

548.2 

C4.5rules  (batch) 

N/A 

76.10%  (3.01%) 

342.27 

124.0 

CN2  (batch) 

N/A 

77.89%  (1.61%) 

161.40 

142.0 

N/A 

77.31%  (3.09%) 

39.81 

1182.20 

ITI  (serial) 

Insensitive 

77.31%  (3.09%) 

1578.49 

1182.20 

CDL4  (serial) 

Given 

75.16%  (3.51%) 

15.87 

547.4 

CDL4  (serial) 

Choose 

77.91%  (1.73%) 

14.06 

438.4 

Table  3:  Comparison  on  the  letter  recognition  task  (20,000x16) 


Program  (Mode) 

TrainingOrder 

Accuracy 

CPU  sec. 

ConceptSize 

C4.5 

N/A 

87.20%  (0.49%) 

125.30 

2199.0 

C4.5rules 

N/A 

85.40%  (0.49%) 

26552.10 

570.0 

CN2 

N/A 

87.94%  (0.89%) 

8456.36 

480.0 

ITI  (batch) 

N/A 

86.84%  (0.48%) 

129.60 

4115.0 

ITI  (serial) 

Insensitive 

86.84%  (0.48%) 

31169.42 

4115.0 

CDL4  (serial) 

Given 

80.92%  (0.25%) 

123.02 

CDL4  (serial) 

Choose 

84.71%  (0.86%) 

120.20 

2101.4 

6.3  Recognizing  Letters  in  Different  Fonts 

In  this  very  large  learning  task  (20,000  instances),  each  algorithm  performs  5  run  cross- 
validation,  using  80%  of  the  randomly  selected  data  for  training  and  the  remaining  20% 
for  testing.  As  shown  in  Table  3,  CDL4’s  speed  (123.02  and  120.20)  is  again  very  close  to 
the  best  non-increment al  algorithm  C4.5  in  the  batch  mode  (125.30),  and  much  faster  than 
the  other  non-incremental  algorithms  in  batch  mode  (26552.10  for  C4.5rules  and  8456.36  for 
CN2)  and  the  other  incremental  algorithm  in  serial  mode  (ITI  31169.42).  CDL4’s  prediction 
accuracy  is  both  behind  the  others  (80.92%  for  the  given  training  orders  and  84.71%  for  the 
selected  orders  using  the  same  method  as  in  the  domain  of  hand- written  numerals)  but  it  is 
not  too  far  away  from  the  best  of  87.94%  of  CN2.  For  reference,  the  best  known  result  in 
this  domain  is  95.7%,  obtained  by  an  algorithm  using  a  nearest-neighbor  classifier  [Fogarty, 
1992].  Since  that  algorithm  does  not  build  descriptions  of  concepts,  it  is  not  included  in  this 
paper.  Notice  again  that  in  selecting  good  training  orders  we  only  considered  a  very  small 
subset  of  possible  orders. 


7  Complexity  Analysis  of  CDL4 

This  section  analyzes  the  complexity  of  CDL4.  Let  d  be  the  number  of  attributes,  m  the 
number  of  previous  examples,  and  c  the  length  of  the  current  decision  list,  we  first  consider 
how  many  comparison  operations  are  needed  for  CDL4  to  revise  the  current  decision  list  for 
a  new  instance  x.  Here,  we  assume  x  will  cause  one  and  only  one  decision  to  be  split. 

Observe  that  the  number  of  comparisons  needed  to  find  a  covering  decision  Dj  for  x  is  at 
most  cd  because  the  size  of  each  decision  test  is  no  greater  than  d.  Assume  the  m  examples 
are  evenly  distributed  in  c  decisions,  then  the  faulty  decision  Dj  has  m/ c  examples  and 
finding  differences  between  x  and  these  examples  takes  at  most  (dm/c)  +  (^/c)  comparisons, 
where  dm/c  is  needed  for  computing  the  strength  of  each  differentiator,  and  m/c  is  needed 
for  determining  how  many  differentiators  should  be  returned.  Since  there  are  at  most  d 
differentiators,  rewriting  Dj  generates  at  most  d  new  decisions,  and  distributing  the  examples 
of  Dj  to  these  new  decision  takes  at  most  d^m/c  (each  example  takes  at  most  dF  comparisons 
to  find  a  new  host  decision).  Thus,  if  we  assume  c>  d,  then  the  total  number  of  comparisons 
needed  for  revising  an  existing  decision  list  for  a  new  instance  is: 

cd  +  dm/c  +  m/c  +  d  +  d^m/c  =  0{cdm/c)  =  0{dm). 

Assume  the  whole  training  set  has  n  instances,  then  the  total  number  of  comparisons  to 
process  all  n  instances  incrementally  is 

0{dm)  =  0{dn^). 

m— 1 

This  complexity  of  CDL4  is  at  least  comparable  with  most  non-incremental  algorithms. 
The  complexity  of  CN2  is  dn^  [Clark  and  Niblett,  1989]  but  our  experiments  show  CDL4  may 
be  faster  by  a  constant  factor.  According  to  [Utgoff,  1989],  IDS  (an  earlier  version  of  C4.5) 
takes  O(nd^)  additions  and  0(2“^)  multiplications  (one  for  each  E-score  calculation),  and 
ID5R  (an  earlier  version  of  ITI)  takes  0{ndh^)  additions  and  0{nh'^)  multiplications,  where 
b  is  the  maximum  number  of  possible  values  for  an  attribute.  CDL4  uses  only  comparison 
operations  and  its  complexity  has  no  exponential  components. 


8  The  Role  of  Active  Learning 

As  we  can  see  from  the  description  and  experiments,  CDL4’s  prediction  accuracy  on  unseen 
instances  are  sensitive  to  the  orders  in  which  training  instances  are  presented.  This  is  because 
CDL4  revises  the  current  decision  list  based  on  the  differences  between  the  current  instance 
and  previous  examples  of  the  similar  concept.  With  different  training  orders,  the  differences 
found  by  CDL4  may  be  different  and  different  decision  list  will  be  built.  For  example,  if 
more  representative  instances  (i.e.,  those  instances  represent  the  critical  differences  between 
target  concepts)  appear  early  in  a  training  order,  then  CDL4  will  build  a  shorter  decision 
list. 

This  effect  of  training  orders  on  the  accuracy  of  incremental  learning  is  actually  a  two- 
edged  sword.  On  the  one  hand,  one  can  exploit  the  training  orders  to  build  more  accurate 


concept  hypotheses,  as  we  have  attempted  to  do  in  our  experiments.  On  the  other  hand, 
there  is  a  lack  of  methods  to  select  good  training  orders  and  manual  selection  is  trail-and- 
error  and  time  consuming  process.  For  example,  Dieterich  [1995]  has  concluded  that  this  is 
a  common  weakness  of  incremental  learning  algorithms. 

Nevertheless,  how  to  select  training  instances  (also  known  as  membership  queries  in 
computational  learning  theory)  for  incremental  learning  is  an  active  research  topic  (see  for 
example  [Cornuejols  et  al,  1993]).  Despite  some  known  negative  results  (for  example,  mem¬ 
bership  queries  will  not  help  with  predicting  CNF  or  DNF  formulas  [Angluin  and  Kharitonov, 
1991]),  there  are  many  positive  results  regarding  active  learning.  For  example,  it  is  known 
that  with  membership  queries  (and  equivalence  queries)  the  class  of  DFA  is  learnable  [An¬ 
gluin,  1987],  and  the  class  of  decision  trees  is  also  learnable  [Bshouty,  1993].  Furthermore, 
there  are  some  initial  results  on  how  to  choose  the  training  order  for  learning  certain  types 
of  concepts  [Goldman  et  al.,  1989],  although  their  emphasis  is  on  minimizing  the  number 
of  prediction  mistakes  in  learning  (regardless  how  much  time  it  takes  before  a  mistake  can 
happen). 

With  an  efficient  incremental  learning  algorithm  for  learning  decision  lists  such  as  CDL4, 
we  are  hoping  to  find  similar  methods  as  Bshouty’s  A- Algorithm  that  can  choose  instances 
actively  and  make  the  learning  of  decision  lists  faster  and  more  accurate.  For  example,  in 
our  experiments  with  different  training  orders,  we  found  that  analysis  of  error  distributions 
is  a  good  way  to  find  a  set  of  classes  that  are  most  “confusing”  so  that  we  can  choose  an 
order  in  which  instances  of  these  classes  are  presented  adjacently.  For  applications  where 
instances  are  free  to  select,  such  methods  can  be  used  to  learn  the  target  concept  quickly.  For 
applications  where  the  set  of  training  instances  is  fixed,  such  methods  can  be  used  to  choose 
the  best  training  order  of  these  instances  so  that  the  learner  can  make  the  most  progress 
towards  the  target  concept.  If  such  methods  can  indeed  be  found,  then  it  is  foreseeable  that 
incremental  learning  could  outperform  non-in cremental  learning  in  many  aspects,  such  as 
speed,  accuracy,  and  noise  tolerance. 


9  Conclusions  and  Future  Work 

In  this  paper,  we  have  presented  a  very  efficient  incremental  algorithm  CDL4  for  learning 
decision  lists.  Experimental  results  have  shown  that  for  serial  tasks,  CDL4  is  much  faster 
than  other  incremental  algorithms  such  as  ITI  and  ID5R.  This  implies  that  CDL4  is  also 
much  faster,  for  serial  tasks,  than  any  non-increment  al  algorithms  rerunning  on  the  training 
set  each  time  a  new  instance  is  accumulated. 

Most  surprisingly,  CDL4  in  its  serial  mode  can  also  perform  as  fast  and  accurate  as  the 
best-known  non-incremental  algorithms  such  as  C4.5  and  CN2  for  batch  tasks.  This  result 
has  put  incremental  learning,  for  the  first  time  to  our  knowledge,  on  the  same  performance 
stage  as  non-incremental  learning  for  both  serial  and  batch  tasks. 

The  research  on  CDL4  has  also  revealed  several  important  future  research  directions. 
In  order  to  further  develop  incremental  learning  algorithms,  we  must  work  on  algorithms 
that  can  actively  select  training  instances  (or  training  orders  when  instances  are  limited). 
Positive  results  on  learning  decision  trees  by  membership  queries  have  already  provided  some 
evidence  that  this  may  be  a  feasible  task. 
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To  be  truly  incremental,  we  must  also  develop  algorithms  that  use  fixed  amount  of  space. 
For  example,  CDL4  should  not  require  remembering  all  previous  examples,  but  only  those 
important  characteristics  of  each  example  set  associated  with  each  decision.  Some  initial 
work  in  this  direction  is  already  underway. 

Finally,  we  shall  point  out  that  the  sensitivity  of  incremental  learning  algorithms  to 
training  orders  should  not  be  viewed  as  a  negative  aspect  of  incremental  learning.  Instead, 
we  shall  use  this  as  an  unique  opportunity  to  further  develop  incremental  learning  algorithms 
that  can  eventually  outperform  non-incremental  learning  in  many  aspects. 
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